Extracting Support Data for a Given Task

نویسندگان

  • Bernhard Schölkopf
  • Christopher J. C. Burges
  • Vladimir Vapnik
چکیده

We report a novel possibility for extracting a small subset of a data base which contains all the information necessary to solve a given classification task: using the Support Vector Algo rithm to train three different types of handwritten digit classifiers, we observed that these types of classifiers construct their decision surface from strongly overlapping small (k: 4%) subsets of the data base. This finding opens up the possibiiity of compressing data bases significantly by disposing of the data which is not important for the solution of a given task. In addition, we show that the theory allows us to predict the classifier that will have the best generalization ability, based solely on performance on the training set and characteristics of the learning machines. This finding is important for cases where the amount of available data is limited. Introduction Learning can be viewed as inferring regularities from a set of training examples. Much research has been devoted to the study of various learning algorithms which allow the extraction of these underlying regularities. No matter how different the outward appearance of these algorithms is, they all must rely on intrinsic regularities of the data. If the learning has been successful, these intrinsic regularities will be captured in the values of some parameters of a learning machine; for a polynomial classifier, these parameters will be the coefficients of a polynomial, for a neural net they will be the weights and biases, and for a radial basis function classifier they will be weights and centers. This variety of different representations of the intrinsic regularities, however, conceals the fact that they all stem I?--^ --------A M”lll a C”,ll‘ll”ll T”“b. In the present study, we explore the Support Vector Algorithm, an algorithm which gives rise to a number *permanent address: Max-Planck-Institut fiir biologische Kybernetik, Spemannstrafle 38, 72076 Tiibingen, Germany ‘supported by ARPA under ONR contract number N00014-94-G-0186 252 KDD-95 of different types of pattern classifiers. We show that the algorithm allows us to construct different classifiers (polynomial classifiers, radial basis function classifiers, and neural networks) exhibiting similar performance and relying on almost identical subsets of the training set, their support vector seZs. In this sense, the support vector set is a stable characteristic of the data. In the csse where the available training data is limited, it is important to have a means for achieving the best possible generalization by controlling characteristics of the learning machine. We use a bound of statistical learning theory (Vapnik, 1995) to predict the degree which yields the best generalization for polynomial classifiers. In the next Section, we follow Vapnik (1995), Baser, Guyon & Vapnik (1992), and Cortes & Vapnik (1995) in briefly recapitulating this algorithm and the idea of Structural Risk Minimization that it is based on. Following that, we will present experimental results obtained with support vector machines. The Support Vector Machine Structural Risk Minimization For the case of two-class pattern recognition, the task of learning from examples can be formulated in the following way: given a set of functions {ja : a E A}, ja : RN + (-l,+l} (the index set A not necessarily being a subset of R”) and a set of examples (x1,Yl),...,(w,w)~ xi E RN, w E (-1, +l}, each one generated from an unknown probability distribution Hr. ul. we want to find a function f.4 which -. ----~ ~---,” provides th&&ll&t p&ible value for the risk R(a) = J KY(x) YI dP(x, Y). The problem is that R(a) is unknown, since P(x, y) is unknown. Therefore an induction principle for risk minimization is necessary. From: KDD-95 Proceedings. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Mining: Pattern Mining as a Clique Extracting Task

One of the important tasks in solving data mining problems is finding frequent patterns in a given dataset. It allows to handle several tasks such as pattern mining, discovering association rules, clustering etc. There are several algorithms to solve this problem. In this paper we describe our task and results: a method for reordering a data matrix to give it a more informative form, problems o...

متن کامل

Prediction of chronic kidney disease in Isfahan with extracting association rules using data mining techniques

Background: Millions of deaths occur around the world each year due to lack of access to appropriate treatment for chronic kidney disease patients. Given the importance and mortality rate of this disease, early and low-cost prediction is very important. The researchers intend to identify chronic kidney disease through the optimal combination of techniques used in different stages of data mining...

متن کامل

UTD-SRL: A Pipeline Architecture for Extracting Frame Semantic Structures

This paper describes our system for the task of extracting frame semantic structures in SemEval–2007. The system architecture uses two types of learning models in each part of the task: Support Vector Machines (SVM) and Maximum Entropy (ME). Designed as a pipeline of classifiers, the semantic parsing system obtained competitive precision scores on the test data.

متن کامل

Shallow Information Extraction from Medical Forum Data

We study a novel shallow information extraction problem that involves extracting sentences of a given set of topic categories from medical forum data. Given a corpus of medical forum documents, our goal is to extract two related types of sentences that describe a biomedical case (i.e., medical problem descriptions and medical treatment descriptions). Such an extraction task directly generates m...

متن کامل

Investigating the relationship between dimensions of occupational, personal, support and task factors with professional learning activities of teachers

The present study aimed to investigate the Relation of professional, personal, support, and task factors dimensions on professional learning activities of male secondary school teachers of Urmia Education Department (District 1) during the academic year 2017-2018. This applied and descriptive-survey research study in terms of its aim and method. The target population included 336 teachers out o...

متن کامل

Support vector regression for prediction of gas reservoirs permeability

Reservoir permeability is a critical parameter for characterization of the hydrocarbon reservoirs. In fact, determination of permeability is a crucial task in reserve estimation, production and development. Traditional methods for permeability prediction are well log and core data analysis which are very expensive and time-consuming. Well log data is an alternative approach for prediction of pe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995